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1. Introduction. A dependence statistic, the Brownian Distance Covari- 
ance, has been proposed for use in dependence measurement and inde- 
pendence testing: we refer to this contribution henceforth as SR [we also 
note the earlier work on this topic of Szekely, Rizzo and Bakirov (2007)]. 
Some advantages of the authors' approach are that the random variables X 
and Y being tested may have arbitrary dimension W and M 9 , respectively; 
and the test is consistent against all alternatives subject to the conditions 
E||X|| P < oo and E||X|| g < oo. 

In our discussion we review and compare against a number of related de- 
pendence measures that have appeared in the statistics and machine learning 
literature. We begin with distances of the form of SR, equation (2.2), most 
notably the work of Feuerverger (1993); Kankainen (1995); 
Kankainen and Ushakov (1998); Ushakov (1999), which we describe in Sec- 
tion 2: these measures have been formulated only for the case p = q = 1, 
however. In Section 3 we turn to more recent dependence measures which 
are computed between mappings of the probability distributions P x , P y , 
and P xy of X, Y, and (X,Y), respectively, to high dimensional feature 
spaces: specifically, reproducing kernel Hilbert spaces (RKHSs). The RKHS 
dependence statistics may be based on the distance [Smola et al. (2007), 
Section 2.3], covariance [Gretton et al. (2005a, 2005b, 2008)], or 
correlation [Dauxois and Nkiet (1998); Bach and Jordan (2002); 
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Fukumizu, Bach and Gretton (2007); Fukumizu et al. (2008)] between the 
feature mappings, and make smoothness assumptions which can improve 
the power of the tests over approaches relying on distances between the 
unmapped variables. When the RKHSs are characteristic [Fukumizu et al. 

(2008) ; Sriperumbudur et al. (2008)], meaning that the feature mapping 
from the space of probability measures to the RKHS is injective, the kernel- 
based tests are consistent for all probability measures generating (X,Y). 

RKHS-based tests apply on spaces W x for arbitrary p and q. In fact, 
kernel independence tests are applicable on a still broader range of (possi- 
bly non-Euclidean) domains, which can include strings [Leslie et al. (2002)], 
graphs [Gartner, Flach and Wrobel (2003)], and groups [Fukumizu et al. 

(2009) ], making the kernel approach very general. In Section 4 we provide 
an empirical comparison between the approach of SR and the kernel statistic 
of Gretton et al. (2005b, 2008) on an independence testing benchmark. 

2. Characteristic function-based dependence measures. We begin with a 
brief review of characteristic function-based independence measures related 
to the statistic V^(X,Y) in SR, equation (2.8); see also Ushakov (1999), 
Section 3.7. 

Feuerverger (1993) proposes two statistics for independence testing, in 
the case where X and Y are univariate. The first, described by Feuerverger 
[(1993), Section 4], is 



and /-,->, /-, and /- denote the empirical characteristic functions (in ac- 
cordance with the notation of SR), however, these take as their argument 
the approximate normal scores of the sample points, 



With an appropriate choice of weight function, Feuerverger obtains the 
statistic 




where W(s,t) is a weight function, 



T' n (s,t) :=P~~(s,t)-P~(s)mt) 



(2.1) 
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where the summation indices denote all r-tuples drawn with replacement 
from the set {l,...,n}, and r is the number of indices of the sum. This 
statistic takes a form similar to the statistic V^(X, Y) in SR, equation (2.8), 
the main differences being the restriction to the univariate case, use of the 1- 
norm, and transformation (2.1). A second statistic, described by Feuerverger 
[(1993), Section 5], is written 

(2.2) T n := J J \T n (s,t)\ 2 W(s,t)dsdt, 

where the term T n (s,t) now simply denotes the difference between the joint 
characteristic function and the product of the marginals [in other words, 
the statistic is identical to that in SR, equation (2.1)]. Feuerverger remarks 
that, for certain choices of W(s,t), the resulting statistic resembles that of 
Rosenblatt (1975), being the £2 distance between the kernel density estimate 
of the joint distribution and that of the product of the marginals. As an 
illustration, Kankainen [(1995), page 54], makes this link explicit, employing 
a Gaussian weight function to obtain the statistic 

(2.3) ^ = — ^ ^2 kjkljk — ^3 ^ kjqlj r + ^ ^ kjkl qr , 

where 

OA\ u f -\\Xj - X k \\ 2 \ f -\\Y g -Y r 

(2.4) k jk :=exp\^ ^ 1 and V :=ex P^ 

One can readily see that this involves transforming the distances of V„(X, Y) 
in SR, equation (2.8), by passing them through a Gaussian distortion: this 
replaces the finite expected norm condition required by SR with a weaker 
requirement. 

A further difference of Kankainen (1995) with respect to Feuerverger 
(1993) is that Kankainen generalizes to the problem of testing mutual inde- 
pendence, although the variables themselves remain univariate. Kankainen 
further enforces scale and location invariance by studentizing each variable. 
Finally, despite their superficial resemblance, a number of important differ- 
ences nonetheless exist between the statistic in (2.2) and that of Rosenblatt 
(1975). Most crucially, the kernel bandwidth is kept fixed for the charac- 
teristic function-based test, rather than decreasing as n rises (a decreasing 
bandwidth is needed to ensure consistency of the kernel density estimates) , 
resulting in very different forms for the null distribution; and there are more 
restrictive conditions on the Rosenblatt-Parzen test statistic [Rosenblatt 
(1975), conditions a.l-a.4]. These issues are discussed further by Feuerverger 
[(1993), Section 5], and Kankainen [(1995), Section 5.4]. An empirical com- 
parison of the null distributions resulting from fixed vs decreasing bandwidth 
is provided by Gretton and Gybrfi (2008). 



n° * — ' re* 

j,k j,q,r j,k,qr 



4 



A. GRETTON, K. FUKUMIZU AND B. K. SRIPERUMBUDUR 



3. RKHS-based dependence measures. We now present a class of depen- 
dence measures (henceforth kernel dependence measures) based on mappings 
of the random variables to reproducing kernel Hilbert spaces, which encode 
features of interest for these variables. We first use Bochner's theorem to 
demonstrate that a subclass of kernel dependence measures is equivalent 
to SR, equation (2.2), under appropriate conditions on the weight function. 
Next, we give an interpretation in terms of covariances between feature space 
mappings, from which we may generalize to broader classes of kernel depen- 
dence measures, including correlations and estimates of the mean square 
contingency. 

3.1. Kernel dependence measures via Bochner's theorem. We describe 
a dependence measure introduced by Gretton et al. (2005b, 2008) which 
constitutes the kernel statistic most closely resembling the characteristic 
function-based statistic of SR, equation (2.2). The present derivation follows 
Smola et al. (2007), Section 2.3. We begin with some necessary terminology 
and definitions. Let z := (x,y) € My^ 9 ', and H be an RKHS with the con- 
tinuous feature mapping 9(z) € H for each z € JR( p+,? ), such that the inner 
product between the features is given by the positive definite kernel function 
h(z, z') := (d(z),6(z'))u. We remark that we never need deal with the feature 
representations 9{z) explicitly (indeed, these may be infinite dimensional): 
rather, we express our statistic entirely in terms of the kernel function, which 
is the inner product between two such mappings. If we restrict ourselves to 
kernels that can be written in terms of the difference of their arguments, 
h(z,z') = \(z — z'), the following theorem applies [Wendland (2005), Theo- 
rem 6.6]. 

Theorem 3.1 (Bochner). A continuous function A:M^ P+9 ' — > M. is posi- 
tive definite if and only if it is the Fourier transform of a finite nonnegative 
Borel measure W(u)du on M^ p+q \ that is, 

(3.1) \{z)= [ e- izTu W(u)du, zeR {p+q \ 

Let us consider the following distance between the joint distribution P := 
P xy and the product of the marginals, Q := P x P y : 

H = J \f P (u)-f Q (u)\ 2 W(u)du, 

where fp and /q are the characteristic functions for P and Q, respectively. 
Assuming further that we can decompose \{z — z') = k{x — x')l{y — y') (on 
which more below), we can rewrite H as 
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x jy e~ iz ' Tu dP(z')- J e- iz ' Tu dQ(z')^W{u)du 

= j j \(z- z')dP(z)dP(z')- J J X(z-z')dP(z) dQ(z') 

-J J X(z-z')dQ(z)dP(z') + j J X(z - z') dQ(z) dQ(z') 

= E{k(X - X')l(Y - Y')} + E{A;(X - X')}E{Z(Y - Y')} 

- 2E{E{k(X - X')\X}E{k(Y - Y')\Y}}. 

We call H the Hilbert-Schmidt independence criterion (HSIC). The test 
statistic in (2.3) is then interpreted as a biased empirical estimate of H 
[an unbiased estimate would replace the ^-statistics with [/-statistics; see 
Gretton et al. (2008)]. We remark at this point that the weight function 
V(l^lp +P l s l'? + ' ? ) i s n °t integrable, hence, Bochner's theorem does not apply 
for this choice of W{u). Thus, interpreting the statistic in SR, equation (2.6), 
as a kernel statistic is not straightforward. 

3.2. Kernel dependence measures via covariance operators. We now ob- 
tain HSIC via a different argument, based on the covariance between feature 
mappings of the variables: we then generalize this to correlation-based de- 
pendence measures, with reference to the statistic 7Z 2 (X, Y) of SR. Our brief 
review draws heavily on the overview of Gretton and Gyorfi (2009), Section 
4. Let J- be an RKHS on MP with feature map 4>(X) and kernel k(X, X') := 
(4>(X),4>(X'))jr, and Q be a second RKHS on with kernel ■) and fea- 
ture map ijj(y). Following Baker (1973); Fukumizu, Bach and Jordan (2004); 
Gretton et al. (2005a); Fukumizu, Bach and Jordan (2009), the cross-covariance 
operator C xy : Q — > F for the measure P xy is defined such that, for all / £ T 
and g <EG, 

(/, C xy g) r = E([f(X) - E(f(X))} [g(Y) - E(g(Y))}). 

The cross-covariance operator can be thought of as a generalization of a 
cross-covariance matrix between the (potentially infinite dimensional) fea- 
ture mappings <fr(x) and ip(y). 

To see how this operator may be used to test independence, we recall 
the following characterization of independence [see, e.g., Jacod and Protter 
(2000), Theorem lO.le]: 



G 



A. GRETTON, K. FUKUMIZU AND B. K. SRIPERUMBUDUR 



Theorem 3.2. The random variables X and Y are independent if and 
only if cov(f(X),g(Y)) = for any pair (f,g) of bounded, continuous func- 
tions. 

While the bounded continuous functions are too rich a class to permit the 
construction of a covariance-based test statistic on a sample, Fukumizu et al. 
(2008); Sriperumbudur et al. (2008) show that when J- is the unit ball in a 
characteristic^ RKHS J 7 , and Q the unit ball in a characteristic RKHS Q, 
then 

sup E([f(X)-E(f(X))}[g(Y)-E(g(Y))])=0 <=> P xy = P x P y . 

feT, 9 eG 

In other words, the spectral norm of the covariance operator C xy between 
characteristic RKHSs is zero only at independence, and is an independence 
statistic [Gretton et al. (2005a)]. Rather than the spectral norm, Gretton et al 
(2005b) propose to use the squared Hilbert-Schmidt norm (the sum of the 
squared singular values), which has a population expression identical to 
HSIC, defined earlier. The RKHS norm implies a smoothness penalty on 
the functions / and g Scholkopf and Smola [(2002), Chapter 4], resulting 
in O p {n~ 1 / 2 ) convergence of the finite sample estimate: interestingly, this 
rate does not depend on the dimensions p and q of X and Y, respectively. 
Following Serfling [(1980), Chapter 5], the asymptotic distribution of the 
statistic under the alternative hypothesis Hi of dependence is Gaussian, 
and the distribution under the null hypothesis Ho of independence is an in- 
finite weighted sum of independent x 2 random variables; see [Gretton et al. 
(2008)] for details. 

As long as k and I are characteristic kernels, then H(P xy ; J 7 , Q) = iff 
X and Y are independent. The Gaussian and Laplace kernels are charac- 
teristic on W [Fukumizu et al. (2008)], and universal kernels [as defined 
by Steinwart, (2001)] are characteristic on compact domains [Gretton et al. 
(2005b), Theorem 6]. Sriperumbudur et al. (2008) provide a simple neces- 
sary and sufficient condition for a translation invariant kernel to be charac- 
teristic on R p : the Fourier spectrum of the kernel must be supported on the 
entire domain. Note that characteristic kernels need not be functions of the 
distance between points: an example is the kernel 

k(x,x') = exp(x T x' / a) 

from Steinwart [(2001), Section 3, Example 1], which is characteristic on 
compact subsets of MP since it is universal. Finally, an appropriate choice 

3 The reader is referred to [Fukumizu et al. (2008); Sriperumbudur et al. (2008)] for 
conditions under which an RKHS is characteristic. We note here that the Gaussian kernel 
on W has this property, and provide further discussion below. 
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of kernels allows testing of dependence in non-Euclidean settings, such as 
distributions on groups, graphs, and strings [see, for instance, Gretton et al. 
(2008), who described independence testing between text fragments in En- 
glish and French, where the null hypothesis was rejected when the French 
extracts were translations from the English]. 

Interestingly, the first RKHS-based independence measures were based on 
the canonical correlation, rather than the covariance: in this respect, they 
more strongly resemble the statistic V? n of SR. Dauxois and Nkiet (1998) 
propose the canonical correlation between variables in a spline-based RKHS 
as a dependence measure, using projection on a finite basis to regularize: 
this dependence measure follows the suggestion of Renyi (1959), but with 
a more restrictive pair of function classes used to compute the correlation 
(rather than the set of all square integrable functions). The variables are 
assumed in this case to be univariate. Likewise, Bach and Jordan (2002) use 
the canonical correlation between RKHS feature mappings as a measure of 
dependence between pairs of random variables. Bach and Jordan employ a 
different regularization strategy, however, which is a roughness penalty on 
the canonical correlates. For an appropriate rate of decay of this regulariza- 
tion with increasing sample size, the empirical estimate of the canonical cor- 
relation converges in probability [Leurgans, Moyeed and Silverman (1993); 
Fukumizu, Bach and Gretton (2007)]. Finally, Fukumizu et al. (2008) pro- 
vide a consistent RKHS-based estimate of the mean-square contingency, 
which is also based on the feature space correlation. This final independence 
measure is asymptotically independent of the kernel choice. When used as 
a statistic in an independence test, this last statistic was found empirically 
to have power superior to the HSIC-based test. 

4. Experiments. In comparing the independence tests (henceforth 
denoted Dist) and HSIC, we used an artificial benchmark proposed by 
Gretton et al. (2008). We tested the independence in two, four, and eight 
dimensions (i.e., p G 1,2,4 and p = q =: d). We reproduce here the data de- 
scription of Gretton et al. for ease of reference. First, we generated n samples 
of two independent univariate random variables, each drawn at random from 
the ICA benchmark densities of Bach and Jordan [(2002), Figure 5]: these 
included super-Gaussian, sub-Gaussian, multimodal, and unimodal distri- 
butions, with the common property of zero mean and unit variance. Second, 
we mixed these random variables using a rotation matrix parametrized by 
an angle 6, varying from to 7r/4 (a zero angle meant the data were inde- 
pendent, while dependence became easier to detect as the angle increased 
to 7r/4; see the two plots in Figure 1). Third, in the cases d = 2 and d = 4, 
independent Gaussian noise of zero mean and unit variance was used to fill 
the remaining dimensions, and the resulting vectors were multiplied by in- 
dependent random two- or four-dimensional orthogonal matrices, to obtain 
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Rotation 6 = ji/8 Rotation 6 = ji/4 Samp:128, Dim:1 Samp:128, Dim:2 




0.5 1 0.5 1 0.5 1 0.5 



Angle (xrc/4) Angle (xn/4) Angle (xn/4) Angle (x7i/4) 

Fig. 1. Top left plots: Example data set for p = q — 1, n = 200, and rotation angles 
8 = 7r/8 (left) and 8 = 7r/4 (right). In this case, both sources are mixtures of two Gaussians 
[source (g) of Bach and Jordan (2002), Figure 5J. We remark that the random variables 
appear "more dependent" as the angle 8 increases, although their correlation is always 
zero. Remaining plots: Rate of acceptance of Ho for the Dist and HSIC tests. "Samp" is 
the number m of samples, and "dim" is the dimension d of x and y. 

random vectors X and Y dependent across all observed dimensions. The re- 
sulting random variables were dependent but uncorrelated. We investigated 
sample sizes n = 128, 512, 1024, and 2048. In estimating the the test thresh- 
old (i.e., the 1 — a quantile of the HSIC and Dist null distributions), we 
randomly permuted the Y sample ordering 200 times, and used the appro- 
priate quantile of the resulting histogram of values. The kernel bandwidths 
for HSIC were set to the median distance between samples of the respective 
variables. Note that a more sophisticated but computationally costly ap- 
proach to bandwidth selection is described by Fukumizu et al. (2008), which 
involves matching the closed-form expression for the variance of HSIC with 
an estimate obtained by data shuffling. 

Results are plotted in Figure 1 (average over 500 independent generations 
of the data). The y-intercept on these plots corresponds to the acceptance 
rate of the null hypothesis Ho of independence, or 1 — (Type I error), and 
should be close to the design parameter of 1 — a = 0.95. Elsewhere, the plots 
indicate acceptance of Ho where the alternative hypothesis Hi of dependence 
holds, that is, the Type II error. 



4 A Matlab implementation of the HSIC test, including the kernel bandwidth selection 
step, may be downloaded from http://www.kyb.mpg.de/bs/people/arthur/indep.htm. 
The software also includes a faster Gamma approximation to the null distribution. 
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We observe dependence becomes easier to detect as 6 increases from to 
7r/4, when n increases, and when d decreases. HSIC does as well as or better 
than Dist in all experiments, with a particular advantage at low sample sizes. 
In this respect, it appears that the additional smoothing employed by the 
RKHS approach has made the associated independence test more robust. 
Earlier experiments by Gretton et al. (2008) indicate that both HSIC and 
Dist outperform the power-divergence statistic of Read and Cressie (1988) 
on these data. This is unsurprising, since, for higher dimensions, a space 
partitioning approach results in too few samples per bin. 

Acknowledgments. We would like to acknowledge Bernhard Schdlkopf 
and Alexander Smola for their collaboration on several of the works refer- 
enced in this discussion. 
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